[core] Fix silent data corruption due to overflow in FieldProductAgg#7906
[core] Fix silent data corruption due to overflow in FieldProductAgg#7906ArnavBalyan wants to merge 1 commit into
Conversation
|
cc @JingsongLi thanks! :) |
|
cc @leaves12138 @JingsongLi can you pls review thanks! |
JingsongLi
left a comment
There was a problem hiding this comment.
Review of FieldProductAgg overflow detection
The intent is good — silent integer overflow during aggregation is indeed a correctness bug that corrupts data. The agg() fix for INTEGER/BIGINT using Math.multiplyExact is clean. The toExactByte/toExactShort helpers are correct since Java promotes byte * byte and short * short to int before the multiplication, so the int result can be range-checked without itself overflowing.
Issues
1. retract() method has the same overflow bug but is not fixed
paimon-core/src/main/java/org/apache/paimon/mergetree/compact/aggregate/FieldProductAgg.java
The retract method performs division, which can also overflow silently in the same cases:
(byte) ((byte) Byte.MIN_VALUE / (byte) -1)→ result is 128, which wraps to -128 when cast to byte(short) ((short) Short.MIN_VALUE / (short) -1)→ result is 32768, wraps to -32768(int) Integer.MIN_VALUE / (int) -1→ wraps toInteger.MIN_VALUEin Java(long) Long.MIN_VALUE / (long) -1→ wraps toLong.MIN_VALUEin Java
If the goal of this PR is to eliminate silent data corruption from overflow, the retract() path should get the same treatment for consistency. At minimum, the TINYINT and SMALLINT cases need the same range-check pattern since the int promotion makes the overflow detectable. For INTEGER and BIGINT, you would need explicit checks (if (accumulator == Integer.MIN_VALUE && inputField == -1) etc.) since Java's / operator does not throw.
2. Exception message could include operand values for debuggability
When a compaction task fails with ArithmeticException("byte overflow"), the operator has no idea which values caused it. Consider including the operands:
throw new ArithmeticException(
String.format("byte overflow: %d * %d = %d", (byte) accumulator, (byte) inputField, value));This is especially important because this exception will surface during compaction, potentially long after the data was written, making reproduction difficult.
3. Consider the failure mode during compaction
An unchecked ArithmeticException thrown during compaction will fail the compaction task. Depending on the retry/error-handling policy, this could block compaction indefinitely for a partition that has overflowing values. This is arguably better than silent corruption, but it would be worth documenting this behavioral change (e.g., in the PR description or release notes) so users understand that previously-silent overflows will now cause task failures.
Minor notes
- The test cases are well chosen (boundary values, both positive and negative overflow). Good coverage.
- FLOAT/DOUBLE overflow to infinity is not addressed, which is fine — that is standard IEEE 754 semantics and expected for floating-point types.
FieldSumAgghas the same class of overflow bugs, but that is out of scope for this PR.
Overall this is a useful correctness fix. The main request is to apply the same protection to retract() for completeness, since a half-fix leaves an inconsistency where agg() detects overflow but retract() on the same type silently corrupts.
53057d1 to
0885945
Compare
|
Thanks for the review! Have addressed all comments:
|
JingsongLi
left a comment
There was a problem hiding this comment.
Review of FieldProductAgg overflow detection
Overall: The PR addresses a real correctness issue — integer multiplication/division overflow silently wrapping and corrupting aggregation results. The approach is sound and the fix is complete across both agg() and retract() paths.
What's good
- Correctness: The overflow detection logic is correct for all four integer types. byte/short use int-promoted arithmetic with range checks (safe because
intcan hold the full product/quotient range). int/long useMath.multiplyExactfor multiply and explicitMIN_VALUE / -1guards for divide. - Error messages: Include operand values, which is critical for debugging compaction failures.
- Test coverage: Tests cover multiply overflow on all 4 integer types + retract (division) overflow on all 4 types, with both positive and negative overflow cases.
- Float/Double left alone: Correct — IEEE 754 overflow to infinity is expected behavior.
Minor suggestions
1. multiplyExactInt/multiplyExactLong — catch-rethrow pattern loses the original exception cause
} catch (ArithmeticException e) {
throw new ArithmeticException(String.format("int overflow: %d * %d", a, b));
}The new exception doesn't chain e as a cause. ArithmeticException lacks a cause constructor, but initCause() works:
ArithmeticException ex = new ArithmeticException(String.format("int overflow: %d * %d", a, b));
ex.initCause(e);
throw ex;2. divideExactByte/divideExactShort — consider aligning with int/long pattern for consistency
The only case where integer division can overflow is MIN_VALUE / -1. The general value > MAX || value < MIN range check works correctly, but you could match the explicit MIN_VALUE && -1 check pattern used by divideExactInt/divideExactLong for consistency across all four types.
3. Behavioral change during compaction
An ArithmeticException during compaction will crash the task. This is better than silent corruption, but for existing tables that already contain overflowing data, upgrading Paimon could cause compaction to fail on those partitions with no easy recovery path. Worth noting in release notes.
4. FieldSumAgg has the same class of bug
FieldSumAgg also does unchecked (byte) accumulator + (byte) inputField etc. with the same silent overflow risk. Out of scope for this PR, but worth opening a follow-up issue.
Purpose
Tests